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Abstract 

Background: The lack of genomic resources can present challenges for studies of non-model organisms. 
Transcriptome sequencing offers an attractive method to gather information about genes and gene expression 
without the need for a reference genome. However, it is unclear what sequencing depth is adequate to assemble the 
transcriptome de novo for these purposes. 

Results: We assembled transcriptomes of animals from six different phyla (Annelids, Arthropods, Chordates, 
Cnidarians, Ctenophores, and Molluscs) at regular increments of reads using Velvet/Oases and Trinity to determine 
how read count affects the assembly. This included an assembly of mouse heart reads because we could compare 
those against the reference genome that is available. We found qualitative differences in the assemblies of 
whole-animals versus tissues. With increasing reads, whole-animal assemblies show rapid increase of transcripts and 
discovery of conserved genes, while single-tissue assemblies show a slower discovery of conserved genes though the 
assembled transcripts were often longer. A deeper examination of the mouse assemblies shows that with more reads, 
assembly errors become more frequent but such errors can be mitigated with more stringent assembly parameters. 

Conclusions: These assembly trends suggest that representative assemblies are generated with as few as 20 million 
reads for tissue samples and 30 million reads for whole-animals for RNA-level coverage. These depths provide a good 
balance between coverage and noise. Beyond 60 million reads, the discovery of new genes is low and sequencing 
errors of highly-expressed genes are likely to accumulate. Finally, siphonophores (polymorphic Cnidarians) are an 
exception and possibly require alternate assembly strategies. 



25 Background 

26 RNA-seq has provided a powerful tool for analysis of 

27 transcriptomes. For non-model organisms with limited 

28 genomic information, transcriptome sequencing provides 

29 a cost-saving tool by only sequencing functional and 

30 protein coding RNAs, thus providing direct information 

31 about the genes [1]. There are many benefits of sequencing 

32 a genome, but for relatively large genomes such as human 

33 and mouse, protein coding regions account for under 5%, 
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thus most of the sequencing effort would go to sequenc- 34 

ing either regulatory regions or repetitive elements [2]. 35 

Smaller genomes could be sequenced and assembled to 36 

complement the transcriptomes, though this is not a 37 

tractable approach if a genome is quite large. Even still, de 38 

novo genome assembly can produce errors by itself [3]. 39 

Despite its advantage, transcriptome assembly does 40 

present additional challenges when compared to genome 41 

assembly. Unlike genomes where most sequences should 42 

be approximately equally represented, coverage of any 43 

given sequence in a transcriptome can vary over sev- 44 

eral orders of magnitude due to expression differences 45 

[4]. Because coverage can vary, there is also a question 46 

of sequencing depth. Theoretically, there is a sequenc- 47 

ing depth beyond which addition of more reads does not 48 

provide new information, known as the saturation depth. 49 
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50 Several studies have used approaches which map reads 

5 1 onto reference genomes and these have suggested satura- 

52 tion depths at 95% gene coverage ranging from 1.2 million 

53 reads to 50 million for mRNA level coverage, and up to 

54 700 million for splice variants [5-7]. However, these stud- 

55 ies all made use of short reads around 36bp and were not 

56 assembling the transcriptomes de novo, 

57 Several recent studies have already made use of next- 

58 generation sequencing reads for de novo transcriptome 

59 assembly [8-15]. The number of reads used for assembly in 

60 these studies varies widely, ranging from 2.6 million reads 

61 up to 106 million reads [10,11]. The assembly strategies 

62 are equally varied, but share the initial step of removing 

63 low-quality reads and adapters whereupon all remaining 

64 reads are assembled. The assembly quality estimates vary 

65 as well with the most common measure of quality based 

66 on BLAST hits to public databases like Uniprot, though 

67 it was noted that under-representation of many taxa in 

68 public databases limits this approach [8]. 

69 While many parameters must be optimized for the spe- 

70 cific assembly, it is both inconvenient and costly to acquire 

71 more reads by resequencing. Presently, there is no clear 

72 consensus of what sequencing depth is optimal or what 

73 factors would contribute to the adequate depth. The prob- 

74 lems of omitted genes or variants are obvious with too few 

75 reads. On the other hand, it was suggested that greater 

76 depth may create errors in differential expression analy- 

77 ses, cost more, and take longer to assemble [16]. Thus, 

78 here we use the same assembly strategy across a diverse 

79 set of organisms to isolate the effects of read count on 

80 assembly quality to attain a general estimate of optimal 

81 read count. We compare trends from de novo assemblies 

82 across six phyla. These animals include the mouse (used as 

83 a control for the non-model samples), the Humboldt squid 

84 Dosidicus gigas, the scaleworm Harmothoe imbricata, 

85 the decapod Sergestes similis, the copepod Pleuromamma 

86 robusta, the ctenophore Hormiphora calif or nensiSy and 

87 the siphonophore Chuniphyes multidentata. To our 

88 knowledge, this is the first study to suggest an optimal 

89 number of reads for de novo assembly for the purposes 

90 of mRNA level analysis. These results are applicable to 

91 studies of organisms with limited genomic resources. 

92 Results and discussion 

93 De novo assembly of transcriptomes 

94 Assembly of mouse heart transcriptome 

95 Raw mouse-transcriptome reads from the ENCODE 

96 project were downloaded from NCBI short- read archive. 

97 Sample SRR453174 (mouse heart RNA-seq) consisted of 

98 82,886,668 x76bp reads as paired-ends. Filtration (see 

99 Methods) removed 11.7% of the reads, almost 95% of 

100 which were due to low quality scores. In order to exam- 

101 ine the role of number of reads on the assembly, we 

102 computationally sub-sampled randomized sets from the 



original library. It is suggested that sequencing of very 103 

small numbers of reads can be most subject to biases 104 

and that cDNA normalization can improve the unifor- 105 

mity of the library at low numbers of reads [17]. Such 106 

an approach might be quite costly, and the computational 1 07 

sub-sampling approach has the advantage of drawing from 1 08 

the largest pool of reads and avoid biases which could 109 

occur at low numbers of reads. Subsets of the filtered llO 

library were generated containing 1,5,10,20,30,40,50,60, ill 

and 70 million reads. Reads from each set were included 112 

in the next largest set, thus all of the reads in the 1 million 1 1 3 

set are included in the 5 million read set, and so forth. 1 1 4 

These sets were assembled with Velvet/Oases [18,19] and 115 

Trinity [20] (For a detailed comparison of assemblers, 116 

see [21]). 117 

Schulz et ai. reported reliable parameters for Oases 118 

which produced high-quality assemblies of mouse and 1 1 9 

human cell cultures, using 64 million and 30 million reads, 1 20 

respectively [19]. This included use of a broad k-mer range 1 21 

with a low starting k-mer of 19 or 21 up to a k-mer of 33 122 

or 35. Accordingly we used k-mers from 21 to 33. Also, a 123 

minimum k-mer coverage is required by Oases to retain 1 24 

any given node during the assembly process; by default 125 

this is 3 in Oases, that is, any node must have at least 126 

three-fold coverage for that node to be used. Some differ- 1 27 

ences were observed in the output when this parameter 1 28 

was changed, and so the same data were assembled with 1 29 

coverage cutoff of 3 (referred to hereafter as C3) and a 130 

stricter cutoff of 10 (CIO). 131 

The number of transcripts (Oases terminology for con- 1 32 

tigs) increases steadily for all assemblies (Figure lA). CIO 133 Fl 

also had substantially fewer transcripts and accordingly 134 

much higher mean and median lengths (Figure IB-D). 135 

The pattern of increase for median and N50 (length for 136 

which half of the total bases are in contigs of this length or 1 37 

longer) tracked the mean for the CIO assembly, but not the 1 38 

C3 assembly which did not have a clear qualitative pattern. 1 39 

The mean, median and N50 were all lower for the Trinity 1 40 

assembly than the C3 despite having far fewer contigs. 1 41 

Oases generates transcript "loci", which is Oases ter- 142 

minology for the de-Brujin graph clusters meant to rep- 143 

resent genes and their splice variants or highly- similar 144 

paralogs. Both curves approach to a plateau for locus 145 

counts (Figure lE-F). The greatest increase in loci was 146 

between using 10 million to 20 million reads for both 147 

C3 and CIO. Similarly, the C3 assembly shows a decrease 148 

in the number of transcripts per read (Figure IG), while 149 

the CIO assembly shows an almost constant number of 150 

transcripts per read. The number of transcripts increases 151 

while the number of loci tend to level off and this means 1 52 

the number of transcripts per locus always increases with 1 53 

more reads (Figure IH). That is, on average, more variants 1 54 

will be generated with more reads even though some of 1 55 

these are likely due to noise. While the Trinity assembly 1 56 
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Figure 1 Assembly metrics for mouse heart transcriptome. Assorted size metrics for tine mouse lieart transcriptome sliowing (A) number of 
transcripts; (B) mean lengtli; (C) median lengtli; (D) N50 of tine assembly; (E) number of loci; (F) loci per million reads; (G) transcripts per million 
reads; (H) transcripts per locus. 



157 more closely matches the trends for transcripts per read 

158 of the C3, the "components" (closest obvious parallel of 

159 loci) remain close to a unit ratio, suggesting that most 

1 60 components have only one associated sequence. 

1 61 Assembly of invertebrate transcriptomes 

1 62 Transcriptomes across a broad range of taxa were assem- 

1 63 bled as with the mouse and statistics of the largest assem- 

164 blies are presented in Table 1. The stated GC content of 

165 the mouse genome is 42% while a subset of conserved 

166 genes showed a much higher value of 51.24% [22,23]. 

1 67 Interestingly, for all assemblies except for mouse, the aver- 

168 age GC content of the assembled contigs was lower than 

169 that of the raw reads (Figure 2), suggesting either that 

170 certain genes contribute much more to the overall GC 

171 content of the library or that biases can be introduced 

1 72 from the assembly. 

173 For three of six samples {D.gigas, HAmbricata and 

174 S.similis), only select tissues were used for RNA extrac- 

175 tion while the rest were whole body {C.multidentata, 

176 H. calif or nensis and P.robusta, It should be noted that 

177 the Cmultidentata sample combined sequences from 

178 the two major tissues, siphosome and nectophore and 

179 that the Rrobusta sample was a combination of multiple 



individuals. This decision was based on size of the ani- 
mals since very small organisms are difficult to dissect. 
Assembly trends analogous to Figure 1 for the six animals 
are shown in Figure 3. Mouse CIO data from Figure 1 are 
shown in gray as reference. Three main trends emerged. 
Whole-body samples were characterized by a rapid gain 
of transcripts and increases in transcript size through 40 
million reads, while all other parameters level off after 
40 million reads. Single tissue samples showed a slow 
gain of relatively long transcripts across fewer loci. Lastly, 
the whole-body siphonophore showed continuous gain 
of both short transcripts and loci without reaching an 
asymptote at the maximum number of reads assembled. 

Four of the animals showed modest gains in mean, 
median and N50 with more reads (average 20% from 
fewest to most reads), while Rrobusta and H. calif or nensis 
nearly doubled from the fewest to the most reads 
(Figure 3B-D). Most of the transcript-length increase 
occurred before 30 million reads, suggesting that adding 
more reads did not produce longer sequences beyond 
that threshold, or that they became longer at the same 
rate that new, short transcripts were generated. As with 
the mouse samples, transcripts were added continually 
with more reads (Figure 3A). Compared to the mouse. 
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Table 1 Assembly Statistics 



tl.2 


Organism 


Mouse 
cov-cutoff-3 


Mouse cov- 
cutoff-10 


Mouse- 
Trinity 


Chuniphyes 
multidentata 


Sergestes 
similis 


Pleuromamma 
robusta 


Dosidicus 
gigas 


Hormiphora 
californensis 


Harmothoe 
imbricata 


tl.3 


Phylum 


Chordata 


Chordata 


Chordata 


Cnidaria 


Arthropoda 


Arthropoda 


Mollusca 


Ctenophora 


Annelida 


tl.4 


Tissue 


Heart 


Heart 


Heart 


Whole body 


Legs 


Whole body 


Mantle 


Whole body 


Scale 


tl.5 


Raw Reads 


82,886,668 


82,886,668 


82,886,668 


103,415,276 


93,597,558 


64,116,306 


60,661,588 


64,675,964 


75,608,018 


tl.6 


Raw GC (%) 


51.90 


51.90 


51.90 


42.29 


50.74 


48.86 


39.89 


53.71 


41.52 


tl.7 


Filtered Reads 


73,187,048 


73,187,048 


73,187,048 


1 02,366,438 


92,423,904 


63,867,922 


56,264,099 


57,583,204 


70,340,105 


tl.8 


Assembled Reads 


70,000,000 


70,000,000 


70,000,000 


80,000,000 


80,000,000 


63,867,922 


56,264,099 


57,583,204 


70,340,105 


tl.9 


Transcripts 


254,215 


62,353 


85,294 


338,254 


107,082 


196,104 


86,897 


175,701 


191,290 


tl.lO 


Total Length (Mbp) 


293.55 


98.84 


79.12 


314.99 


159.59 


240.05 


143.09 


272.23 


216.66 


tl.ll 


Mean (bp) 


1,154 


1,585 


927 


931 


1,490 


1,224 


1,646 


1,549 


1,132 


tl.l2 


Median (bp) 


547 


1,119 


421 


421 


837 


855 


1,026 


1,153 


689 


tl.13 


N50(bp) 


2,364 


2,447 


1,828 


1,854 


2,803 


1,993 


2,876 


2,373 


1,949 


tl.l4 


Oases Loci 


77,41 1 


20,889 


70272 


49,831 


18,139 


22,385 


14,227 


1 7,960 


21,914 


tl.l5 


GC (%) 


54.08 


53.95 


53.46 


31.24 


44.66 


45.78 


36.55 


51.66 


40.53 


tl.l6 


Summary statistics of the largest transcriptome assembly for each organism. 
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Mouse-C3 C.multidentata S.similis P.robusta 
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Figure 2 Histograms of GC distributions. Dashed lines show the normalized abundance of transcripts by GC content, while solid lines show 
normalized abundance of the raw reads. 
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• c.multidentata • H.califomensis 




• S.similis 


• D.gigas 
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Figure 3 Assembly metrics for marine organisms. Assorted size metrics as in Figure 1 ; (A) number of transcripts; (B) mean length; (C) median 
length; (D) N50 of the assembly; (E) number of loci; (F) loci per million reads; (G) transcripts per million reads; (H) transcripts per locus. Purple - C. 
multidentota; blue - H. californensis; teal - P. robusta; green - D. gigas; yellow - H. imbricata; red - S. similis. For comparison, CI 0 mouse data are shown 
in gray. 
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204 on average these six animals all had more transcripts 

205 per locus (Figure 3H). It is unclear why this would 

206 be the case, though the CIO assembly had the fewest 

207 number of transcripts overall for all numbers of reads. 

208 The most pronounced gains in loci happened within 

209 the first 10 million reads, particularly for Rrobusta and 

21 0 H. calif or nensis (Figure 3E-F). Gains in loci tended to level 

21 1 out between 40 and 60 million reads, suggesting most 

212 genes (or parts of genes) were assembled by 60 million 

213 reads. 

214 A very high number of transcripts for Cmultidentata 

215 (Figure 3, purple) led to the lowest mean, median, and 

216 N50. The number of removed, low-quality reads is com- 

21 7 parable in this sample to others, so low quality is unlikely 

218 to be the cause. As two sets of reads were combined 

219 into a whole animal, this may have created artifacts. 

220 However, another Cmultidentata siphosome sample pro- 

221 duced assemblies with large numbers of relatively short 

222 sequences (data unpublished). One possible explanation 

223 is that siphonophores have continuously developing dif- 

224 ferentiated zooids [24]. These zooids have specialized 

225 functions which are in some ways analogous to organs, 

226 and a whole organism can contain multiple developmental 

227 stages and express a large part of the genome, possi- 

228 bly confounding the assembly process. Assemblies of a 

229 number other siphonophores (data unpublished) similarly 

230 had many short transcripts. We speculate that alternate 

231 assembly strategies or very careful dissections might be 

232 required for animals in this lineage. 

233 Discovery of conserved genes 

234 Conserved mouse genes 

235 One approach used to assess genome completeness is to 

236 search only for conserved eukaryotic orthologous genes 

237 (KOGs). The current NCBI KOG database has 860 gene 

238 clusters across 7 eukaryotes with over 16000 proteins 

239 [25]. The KOG reference genes did not include mouse 

240 sequences, and this provided an opportunity to test pre- 

241 dictions about de novo transcriptome quality while still 

242 having a reference in the end to confirm the reliability 

243 of the sequences. For each KOG, the transcripts were 

244 aligned against the reference KOGs with tblastn, and the 

245 best coding sequence was kept. The putative proteins were 

246 classified by length relative to the range of sizes of the 

247 reference KOGs. The size range allowed some flexibility, 

248 as 12 mouse proteins were larger than the longest ref- 

249 erence protein for that KOG, and 5 were shorter than 

250 the shortest reference protein. Finally the proteins were 

251 aligned with blastp against reviewed mouse proteins in 

252 Uniprot to determine accuracy. One protein was unre- 

253 viewed (Q3UWL8, Mouse Prefoldin 4). For this test, Trin- 

254 ity and Oases are comparable at assembling full-length 

255 proteins, though Trinity appears to be slightly better at 

256 reconstructing canonical proteins (Figure 4A). 



However, gene duplications present difficulties for such 257 

assessments unless one had a priori knowledge of how 258 

many copies should be present in the genome. For this 259 

study, we also used the subset of eukaryotic KOGs con- 260 

taining 248 genes from the CEGMA pipeline which 261 

were identified as single-copy orthologs in most genomes 262 

[26,27]. Almost one third of these KOGs are involved 263 

in processes like transcription and translation and were 264 

expected to be expressed in many tissues. Trinity and 265 

Oases with a lower coverage cutoff of 3 found simi- 266 

lar numbers of KOGs at much lower numbers of reads 267 

(Figure 4B) than compared to the CIO assembly. Also 268 

more KOGs were found within expected length much 269 

faster with C3 than with the higher cutoff of 10, and 270 

the Trinity assembly outperformed both of these. These 271 

results suggest that it is better to have a lower cutoff and 272 

assemble more sequences. Likewise, the Trinity assem- 273 

bly had more transcripts than CIO and were shorter than 274 

those in C3, yet more KOGs were found with fewer 275 

reads and more coding transcripts were correctly assem- 276 

bled at greater numbers of reads. However, for the Oases 277 

assemblies this had remarkably little effect on the number 278 

of correct canonical proteins that were found (Figure 4, 279 

triangles). Although there is some overestimation, no pro- 280 

tein designated as too short was ever correct. Regarding 281 

the fate of the other full-length proteins, for C3 at 70 mil- 282 

lion reads, 186 KOGs were found within the expected 283 

range, though only 131 were correct. Eight of the 186 284 

KOGs had only 1 mismatch in the amino-acid sequence 285 

compared to the reference protein which could be due 286 

to errors, splice variants, tissue-specific modifications or 287 

alleles. The remaining KOGs had at least two amino- 288 

acid changes but were within the size range. Thus for 289 

the mouse, the size range was a reliable predictor of true 290 

full-length proteins. 291 

Conserved in vertebrate genes 292 

We then examined our invertebrate transcriptomes for 293 

completion using the same set of KOGs. There was a 294 

clear, qualitative difference between whole-body organ- 295 

isms (Figure 5A) and dissected tissues (Figure 5B). CIO 296 F5 

mouse data are included for reference. For whole-body 297 

transcriptomes, over 90% of the KOGs were detectable at 298 

20 million reads, yet the number of within-length KOGs 299 

went down with higher numbers of reads past 20 mil- 300 

lion. This could be caused if proteins declared to be 301 

within-range were longer than the true protein due to mis- 302 

assembly causing addition of pieces, or if the true protein 303 

became mis-assembled with addition of noisy reads. In 304 

nearly all of our assemblies, it was the latter: mis-assembly 305 

of the putative protein which generated stop codons. 306 

Cmultidentata (Figure 5 A, purple) was again exceptional, 307 

as the number of within-length KOGs increased more 308 

slowly with addition of more reads than the other two 309 



Francis eta/. BMC Genomics 20^ 3, 14:167 
http://www.bionnedcentral.conn/l 471 -21 64/1 4/1 67 



Page 7 of 1 1 



30 40 50 
Reads (Millions) 




30 40 50 
Reads (IVlillions) 



Samples 

i Mouse-C10 • Mouse-C3 



Figure 4 Conserved genes in tiie mouse transcriptome. Saturation curves of discovery of genes in tine mouse lieart from a set of (A) 860 

conserved ortliologs from NCBI and (B) a subset of 248 conserved ortliologs; genes wliicli liave any blast liit are sliown in circles; genes which the 
translated protein was within the expected size range of the conserved gene are in squares; proteins which are 100% identical to a canonical 
protein in Uniprot/Swissprot mouse database are shown in triangles. 



310 whole-body animals [H. calif or nensis and Rrobusta) and 

311 only decreased after 50 million reads rather than 20 

312 million, 

313 For dissected-tissue transcriptomes {Dosidicus gigas, 

314 Harmothoe imbricata, and Sergestes similis), the rate 

315 of discovery of KOGs was much slower with between 

316 63% and 81% of KOGs detectable at 20 million reads 

317 (Figure 5B). This was not surprising since those genes 

318 may not be highly-expressed in all tissues and it is likely 

319 tissue-specific genes account for the bulk of the assem- 

320 bly at low numbers of reads. Isolated tissues may express 

321 fewer universal KOGs that we selected in our test, and 



we expected that other abundant transcripts should mis- 322 

assemble at high numbers of reads in that tissue. However, 323 

the dissected-tissue transcriptomes had longer transcripts 324 

and fewer loci, suggesting this was not the case. Since 325 

whole-animal transcriptomes include all tissues, a greater 326 

proportion of the genome is expressed so coverage of any 327 

given transcript or splice-variant is proportionally much 328 

lower. The length saturation patterns appear to be dif- 329 

ferent between whole-animal and tissue transcriptomes. 330 

However, using conserved genes as a metric, there appears 33 1 

to be limited benefit of sequencing beyond 60 million 332 

reads. 333 
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Figure 5 Conserved genes in marine organisms. As in Figure 4, genes with a reliable blast hit are shown in circles for all 6 marine organisms; 
genes which the translated protein was within the expected size range of the conserved gene are in squares. Purple - C. multidentota; blue - H. 
californensis; teal - P. robusta; green - D. gigas; yellow - H. imbricata; red - 5. similis. For comparison, CI 0 mouse data are shown in gray. 
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334 Mis-assembly at high numbers of reads 

335 KOGs with single-exon coding sequences in the mouse 

336 were examined for mis-assembly. To increase the num- 

337 ber of genes examined, another set of KOGs from only 

338 metazoans {C.elegans, D.melanogaster and Ksapiens, 

339 CDH) was used. The KOG database at NCBI con- 

340 tained 1147 clusters common to CDH. Again, only 

341 genes that were annotated as single copy in all three 

342 animals were used, leaving a final set of 202 KOGs 

343 specific to metazoans. These combined sets of 450 

344 had 12 genes in mouse which were presumed single- 

345 copy and annotated in NCBI to have a single-exon 

346 coding sequence (GenBank:NP_062724.1, NP_666327.2, 

347 NP_082281.2, NP_058612.3, XP_899832.1, NP_001153802.1, 

348 NP_001104758.1, NP_077152.1, XP_486217.2, NP_598737.1, 

349 NP_032025.2, NP_075969.1). At 70 million reads, 3 genes 

350 in C3 had alternate erroneous coding sequences: 

351 NAT6, CHMP1B1/DID2, FTSJ (N-acetyl transferase 6, 

352 Charged multivesicular body protein lb-1, Ribosomal 

353 RNA methyltransferase, respectively). The sequence of 

354 CHMPIBI was never assembled correctly for any number 

355 of reads and the best version was missing 9 amino acids 

356 at the N-terminus including the start codon. Only NAT6 

357 had extraneous coding sequence in CIO, suggesting that 

358 such errors can be controlled by limiting read count as 

359 well as increasing k-mer coverage thresholds. 

360 While some mis-assemblies can occur with more reads, 

361 overall this is not a problem, as shown by the curves in 

362 Figures 4 and 5. However, select cases of mis-assembly of 

363 the mouse genes are shown in Figure 6. AlaRS (Alanyl- 

364 tRNA synthetase) presents an example of the optimal 

365 scenario, whereby the protein is not found at all with few 

366 reads, but then pieces come together with the addition 

367 of more reads until the final protein is correctly assem- 

368 bled. The majority of proteins follow this trend. 2-OGDH 

369 shows an unusual oscillation between the reference pro- 

370 tein and alternate forms. EF2 is assembled correctly with 

371 few reads, then errors accumulate as more reads are 

372 added. From this, it cannot be assumed that the largest set 

373 of reads will produce the best contigs. Schulz et aL indi- 

374 cated that between 10 and 20% of Oases transcripts had 

375 some degree of misassembly [19]. This value was found 

376 to correlate with the smallest k-mer used in assembly and 

377 the authors suggest using larger k-mers if problems arise 

378 due to chimeric transcripts. Thus if using more reads, it 

379 may be advisable to use larger k-mers or a higher static 

380 coverage cutoff. 

381 Conclusions 

382 In this study, number of whole animals and tissues from 

383 non-model organisms and one mouse organ were assem- 

384 bled and the completeness was assessed using a set of 

385 conserved genes. Additionally, a comparison was made 

386 between two high-performing assemblers with respect to 



the mouse data. Oases required much greater memory 387 

usage while Trinity had much longer run times (approx- 388 

imately 2-fold longer). Both Trinity and Oases perform 389 

comparably at assembling conserved genes across a large 390 

set, indicating that the saturation depth is not greatly 391 

affected by assembler choice. 392 

Overall, these results suggest that for whole-body tran- 393 

scriptomes and individual organs or cells, 30 and 20 394 

million reads are sufficient for mRNA level coverage, 395 

respectively. For the read length used in this study, that 396 

would produce 2-3 gigabases of sequence. It should be 397 

noted that the mouse data consisted of shorter reads than 398 

used for the invertebrates, but this did not appear to have 399 

substantial effect as this difference was only between 75bp 400 

reads and lOObp reads. Assembly errors are evident in 401 

whole-body transcriptomes after 30 million reads, and the 402 

average length appeared to level off at the same depth. 403 

Presumably this depth would apply for studies of differen- 404 

tial expression as well, as the highly expressed transcripts 405 

should be present and distinguishable at that sequencing 406 

depth. In our experience, we find it is optimal to acquire 407 

between 50 and 60 million reads, and then sub-sample up 408 

around 20 or 30 million. This approach reliably assembles 409 

nearly all proteins of interest. There are still observable 410 

differences between assemblies, although some of these 41 1 

differences may ultimately be due to variations in RNA 412 

quality or properties of the animal. 41 3 

Methods 4i4 

Samples and sequencing 41 5 

D.gigas and H. calif or nensis were collected in the 416 

Gulf of California by jig and trawl net, respectively. 417 

Cmultidentata and S.similis were collected in the 418 

Monterey Bay using remotely-operated-underwater vehi- 41 9 

cles. HAmbricata samples were given courtesy of T. Rivers. 420 

All samples were flash frozen in liquid nitrogen immedi- 421 

ately following collection. Total RNA was extracted using 422 

RNeasy kit (Qiagen) as per instructions. Cmultidentata 423 

RNA was extracted with Trizol and purified with the 424 

RNeasy kit. Preparation of RNA-seq libraries was done 425 

using lUumina TruSeq kit for paired end reads. Total 426 

RNA was sent for sequencing at University of Utah. Mul- 427 

tiple individuals of Rrobusta were sampled off the coast 428 

of Namibia and sequenced at the Institute for Clinical 429 

Molecular Biology, (IKMB, Kiel University). Sequenc- 430 

ing was done using the lUumina HiSeq2000 platform 431 

on a paired-end protocol with 100 cycles. Mouse heart 432 

data were downloaded from NCBI accession GSE36025, 433 

sample SRR453174. 434 

Transcriptome assembly 435 

All computations were done on a computer with two 436 

quad-core processors and 96GB RAM. For each sample, 437 

the orders of all raw reads were randomized with the 438 
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Figure 6 Selected cases of misassembly. Orthologs were tracked across multiple sequencing depths, and selected examples are here showing 
some of the pitfalls of assembly. (A) The lengths of three proteins are shown (AlaRS, Alanyl-tRNA synthetase; 2-OGDH-E2, 2-oxoglutarate 
dehydrogenase subunit E2; EF2, Elongation factor 2), and the canonical protein length is indicated by a blue line. (B) Protein alignment view of the 
same three proteins compared to the Uniprot/Swissprot canonical protein, which is shown as the blue bar. A chimeric portion of AlaRS at 30 million 
reads is indicated by the red bar, where it contains a sequence from the putative mitochondrial alanyl-tRNA synthetase 2 protein (NP_941 01 0), and 
corresponds to the red point at 30 in (A). For AlaRS and EF2, some alignments produced a few short gaps compared to the reference proteins. 



439 randomize.cpp program and processed with a modified 

440 version of the filterJllumina.cpp program in the Agalma 

441 transcriptome package (https://github.com/caseywdunn/ 

442 agalma). This removed low-quaUty reads (with mean 

443 Phred score < 28), as well as reads containing adapters 

444 and reads that were mostly repeated bases, such as polyT 

445 tracts. Reads from pairs with one good read and one bad 

446 read retained the good read for the largest assembly. Oth- 

447 erwise, only good pairs were used in other assemblies. The 

448 transcriptome for each set was assembled de novo using 

449 Velvet vl.2.06 /Oases vO.2.06. Identical assembly param- 

450 eters were used unless otherwise noted. Multiple k-mer 

451 assemblies were generated (21,25,29,33) and merged with 

452 Oases-M (k -mer of 27). A static coverage cutoff of 10 was 

453 used and insert size of the paired ends was estimated with 

454 the "-exp_cov auto" parameter, typically around 180bp, 

455 as expected. The minimum contig length was set to 100, 

456 which is the read length. The Trinity assembler was also 

457 used for comparison of mouse assemblies using the same 

458 filtered subsets of reads. Other than insert length being 

459 specified as the upper limit rather than the mean, default 

460 assembly parameters were used including a minimum 



transcript length of 200bp. Transcript lengths and GC 461 

content were measured with an in-house python script, 462 

sizecutter.py, available at the MBARI public repository 463 

(bitbucket.org/beroe/mbari-public/src). 464 

Conserved gene analyses 465 |Q2| 

All blast searches were done using the NCBI blast 2.2.25+ 466 

package [28]. We generated a script to blast and ana- 467 

lyze the matches, kogblaster.py (on the public repository, 468 

as above). Briefly, the reference KOGs (860 orthologous 469 

groups from NCBI, or 248 orthologous groups, from 470 

http://korflab.ucdavis.edu/Datasets/cegma/) were aligned 471 

to each assembly with tblastn with an e-value cutoff of 472 

10~^. For each alignment, the subject hit was translated 473 

and coding sequences were only kept if they contained 474 

both start and stop codons. From this subset, the best 475 

alignment was declared to be the correct sequence. Next, 476 

the length of the correct sequence was used to estimate 477 

whether that sequence was full-length relative to the con- 478 

served orthologs. For each KOG in the CEGMA dataset, 479 

there were 6 proteins from 6 species and there was some 480 

variability in protein length (average 11.8% from longest 481 
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482 to shortest). The variabiUty from the the reference set was 

483 used to establish boundaries for size classifications which 

484 were made to watch the progression of assembly of indi- 

485 vidual genes: (1) within the size range of the KOG; (2) 

486 within the range but where the alignment was less than 

487 90% of the length of the protein; (3) longer than those in 

488 the size range; (4) shorter than the size range; (5) shorter 

489 than the size range and shorter than the alignment, often 

490 indicative of a stop codon bridged by the alignment. The 

491 full-length size range was defined by ratios of the short- 

492 est protein to the second shortest, and analogously for 

493 the longest protein and second longest. For example, if 

494 the shortest protein within a KOG was 80 A As, and the 

495 second shortest was lOOAAs, the lower bound would be 

496 (80 * (80/100)), and thus 64AAs. This was calculated for 

497 each KOG, and was to account for proteins which could 

498 potentially become the 'new' shortest or longest. Ulti- 

499 mately, only those within the size range (1) were declared 

500 as full-length sequences. 

501 The animals in this study were treated ethically and 

502 responsibly. Because no vertebrates or octopus were 

503 involved, no formal certification is required per the 

504 Helsinki Declaration. The mouse data presented in the 

505 paper were not obtained from our experiments, but were 

506 downloaded from a database. 
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